[SPARK-732][SPARK-3628][CORE][RESUBMIT] eliminate duplicate update on accmulator #2524

CodingCat · 2014-09-24T18:40:07Z

https://issues.apache.org/jira/browse/SPARK-3628

In current implementation, the accumulator will be updated for every successfully finished task, even the task is from a resubmitted stage, which makes the accumulator counter-intuitive

In this patch, I changed the way for the DAGScheduler to update the accumulator,

DAGScheduler maintains a HashTable, mapping the stage id to the received <accumulator_id , value> pairs. Only when the stage becomes independent, (no job needs it any more), we accumulate the values of the <accumulator_id , value> pairs, when a task finished, we check if the HashTable has contained such stageId, it saves the accumulator_id, value only when the task is the first finished task of a new stage or the stage is running for the first attempt...

CodingCat · 2014-09-24T18:46:26Z

@mateiz @mridulm @kayousterhout @markhamstra @pwendell @JoshRosen I proposed this as an resubmission of #228

Expecting your review

SparkQA · 2014-09-24T18:52:22Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20760/

SparkQA · 2014-09-24T18:59:22Z

QA tests have started for PR 2524 at commit 13a190e.

This patch merges cleanly.

SparkQA · 2014-09-24T19:52:25Z

QA tests have finished for PR 2524 at commit 13a190e.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Accumulator[T](@transient initialValue: T, param: AccumulatorParam[T],

SparkQA · 2014-09-24T19:52:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20761/

SparkQA · 2014-09-24T19:59:26Z

QA tests have started for PR 2524 at commit 9fbe39a.

This patch merges cleanly.

SparkQA · 2014-09-24T21:06:55Z

QA tests have finished for PR 2524 at commit 9fbe39a.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Accumulator[T](@transient initialValue: T, param: AccumulatorParam[T],

SparkQA · 2014-09-24T21:06:58Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20763/

CodingCat · 2014-09-24T22:25:36Z

OK...I will make MIMA happy.....

SparkQA · 2014-09-24T23:22:20Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20768/

SparkQA · 2014-09-25T01:19:36Z

QA tests have started for PR 2524 at commit af7ff02.

This patch merges cleanly.

SparkQA · 2014-09-25T02:27:04Z

QA tests have finished for PR 2524 at commit af7ff02.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Accumulator[T](@transient initialValue: T, param: AccumulatorParam[T],

AmplabJenkins · 2014-09-25T02:27:08Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20775/

CodingCat · 2014-09-26T13:13:05Z

BTW, if we don't want to de-duplicate in shuffle stages, we can just move the necessary part to TaskSetManager

mateiz · 2014-09-28T03:58:55Z

Let's not de-duplicate in shuffle stages please. That complicates the patch a lot and I'm not sure why people would necessarily use it.

Also, why did you add a duplicate flag to Accumulator? IMO we shouldn't expose this as an option. Again it adds complexity in what should just be a bug fix.

mateiz · 2014-09-28T03:59:41Z

Basically it would be great to get a really simple patch that only fixes SPARK-3628 and adds no new data structures in DAGScheduler.

CodingCat · 2014-09-28T04:08:54Z

the drawbacks for us not to de-duplicate in shuffle stage is that, it makes accumulator usage to be very tricky...

it sounds like you are not encouraged to use accumulator in a transformation, especially when the involved stage is shared by multiple jobs or your cluster is not that stable....

for adding flag, just provide flexibility for the user to choose whether they would like to accept duplicate update....

CodingCat · 2014-09-28T04:09:38Z

I can simply monitor the accumulator update in TaskSetManager, just not sure if that can maximumly resolve the problem.....

mateiz · 2014-09-28T04:16:22Z

It's probably easiest to move the accumulator update to TaskSetManager or to the part of DAGScheduler that reports the result to the user. It's right below the current update in the code:

if (!job.finished(rt.outputId)) {
  job.finished(rt.outputId) = true
  ...

That happens only once per task, so it's a good place to do the update for ResultTask. For ShuffleMapTask you can do it in the corresponding match statement as well.

witgo · 2014-09-28T09:53:43Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

@@ -112,6 +112,10 @@ class DAGScheduler(
  //       stray messages to detect.
  private val failedEpoch = new HashMap[String, Long]

+  // stageId => (SplitId -> (accumulatorId, accumulatorValue))
+  private[scheduler] val stageIdToAccumulators = new HashMap[Int,


This may cause a memory leak?

CodingCat · 2014-09-30T00:01:41Z

I think it should work...I'm trying this

SparkQA · 2014-09-30T02:44:26Z

QA tests have started for PR 2524 at commit 92d5fec.

This patch merges cleanly.

SparkQA · 2014-09-30T03:54:19Z

QA tests have finished for PR 2524 at commit 92d5fec.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-09-30T03:54:22Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21011/

CodingCat · 2014-09-30T11:28:38Z

OK, Jenkins said OK

Finished the modification,

Removed the option for the user to choose whether the accumulator accepts duplication (this may need a highlight in release note)
still de-duplicate in DAGScheduler, since we agreed to de-duplicate Shuffle Stage and a new TaskSetManager is created for re-submitted stage, so DAGScheduler becomes the only place we can work on
Still need to add a data structure in DAGScheduler, the reason is that ShuffleMapTask does not have something like outputid which can be used to associate a re-submit and original task....StageId + partitionId is the only thing we can use...

mateiz · 2014-11-26T01:30:21Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

@@ -901,6 +900,33 @@ class DAGScheduler(
    }
  }

+  private def updateAccumulator(event: CompletionEvent): Unit = {


Call this updateAccumulators and add a comment saying
/** Merge updates from a task to our local accumulator values */

mateiz · 2014-11-26T01:37:11Z

@CodingCat thanks for the update, this looks good. I just made a few small comments.

SparkQA · 2014-11-26T13:22:51Z

Test build #23893 has started for PR 2524 at commit b233737.

This patch merges cleanly.

SparkQA · 2014-11-26T14:42:44Z

Test build #23893 has finished for PR 2524 at commit b233737.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-26T14:42:47Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23893/
Test FAILed.

CodingCat · 2014-11-26T14:49:37Z

Hey, @mateiz , thank you very much for the review,

I addressed all of them except the "lastId" one, as MIMA wants me to keep that since it's public.....

also, a question for you, shall I submit the patch to the old version branches, since there are some merge conflicts preventing the patch directly to there

SparkQA · 2014-11-26T14:50:05Z

Test build #23895 has started for PR 2524 at commit 1433e6f.

This patch merges cleanly.

SparkQA · 2014-11-26T16:09:25Z

Test build #23895 has finished for PR 2524 at commit 1433e6f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-26T16:09:28Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23895/
Test PASSed.

mateiz · 2014-11-26T22:06:03Z

Can you just not change Accumulator.scala then? That change isn't fixing any kind of bug, it's just a small optimization. Just remove it from this patch.

SparkQA · 2014-11-26T22:50:16Z

Test build #23908 has started for PR 2524 at commit 701a1e8.

This patch merges cleanly.

CodingCat · 2014-11-26T23:01:49Z

@mateiz sure, just rollback the changes...how about the question to apply the patch to other branches?

mateiz · 2014-11-27T00:10:09Z

Don't worry about the other branches now, we can figure it out if we want to backport it.

SparkQA · 2014-11-27T00:10:41Z

Test build #23908 has finished for PR 2524 at commit 701a1e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-27T00:10:45Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23908/
Test PASSed.

mateiz · 2014-11-27T00:57:02Z

Alright, thanks! I've merged this in.

… accmulator https://issues.apache.org/jira/browse/SPARK-3628 In current implementation, the accumulator will be updated for every successfully finished task, even the task is from a resubmitted stage, which makes the accumulator counter-intuitive In this patch, I changed the way for the DAGScheduler to update the accumulator, DAGScheduler maintains a HashTable, mapping the stage id to the received <accumulator_id , value> pairs. Only when the stage becomes independent, (no job needs it any more), we accumulate the values of the <accumulator_id , value> pairs, when a task finished, we check if the HashTable has contained such stageId, it saves the accumulator_id, value only when the task is the first finished task of a new stage or the stage is running for the first attempt... Author: CodingCat <zhunansjtu@gmail.com> Closes #2524 from CodingCat/SPARK-732-1 and squashes the following commits: 701a1e8 [CodingCat] roll back change on Accumulator.scala 1433e6f [CodingCat] make MIMA happy b233737 [CodingCat] address Matei's comments 02261b8 [CodingCat] rollback some changes 6b0aff9 [CodingCat] update document 2b2e8cf [CodingCat] updateAccumulator 83b75f8 [CodingCat] style fix 84570d2 [CodingCat] re-enable the bad accumulator guard 1e9e14d [CodingCat] add NPE guard 21b6840 [CodingCat] simplify the patch 88d1f03 [CodingCat] fix rebase error f74266b [CodingCat] add test case for resubmitted result stage 5cf586f [CodingCat] de-duplicate on task level 138f9b3 [CodingCat] make MIMA happy 67593d2 [CodingCat] make if allowing duplicate update as an option of accumulator (cherry picked from commit 5af53ad) Signed-off-by: Matei Zaharia <matei@databricks.com>

aarondav · 2014-11-27T17:01:16Z

@mateiz @CodingCat Apologies, but can I confirm that the scope of this change is strictly to ensure that actions/result stages never duplicate accumulator updates? The PR title and description are more general than this, but the associated JIRAs suggest the restricted scope.

CodingCat · 2014-11-27T17:04:10Z

yes, originally, I tried to do it for both shuffletask and resultask, later, @mateiz convinced me that we actually cannot handle transformation case

so the current change only involves result task,

apologize for not changing the PR title on time

mateiz · 2014-11-28T01:04:50Z

Yes, it should be only SPARK-3628.

CodingCat changed the title ~~[SPARK-732][RESUBMIT] make if allowing duplicate update as an option of accumulator~~ [SPARK-732][SPARK-3628][RESUBMIT] make if allowing duplicate update as an option of accumulator Sep 24, 2014

CodingCat force-pushed the SPARK-732-1 branch from be10b9b to 13a190e Compare September 24, 2014 18:53

CodingCat force-pushed the SPARK-732-1 branch from 13a190e to 9fbe39a Compare September 24, 2014 19:54

CodingCat force-pushed the SPARK-732-1 branch from 0ef91fc to af7ff02 Compare September 25, 2014 01:13

CodingCat changed the title ~~[SPARK-732][SPARK-3628][RESUBMIT] make if allowing duplicate update as an option of accumulator~~ [SPARK-732][SPARK-3628][CORE][RESUBMIT] make if allowing duplicate update as an option of accumulator Sep 25, 2014

witgo reviewed Sep 28, 2014
View reviewed changes

mateiz reviewed Nov 26, 2014
View reviewed changes

address Matei's comments

b233737

make MIMA happy

1433e6f

roll back change on Accumulator.scala

701a1e8

asfgit closed this in 5af53ad Nov 27, 2014

JoshRosen mentioned this pull request Dec 24, 2014

SPARK-3642. Document the nuances of shared variables. #2490

Closed

[SPARK-732][SPARK-3628][CORE][RESUBMIT] eliminate duplicate update on accmulator #2524

[SPARK-732][SPARK-3628][CORE][RESUBMIT] eliminate duplicate update on accmulator #2524

Conversation

CodingCat commented Sep 24, 2014

CodingCat commented Sep 24, 2014

SparkQA commented Sep 24, 2014

SparkQA commented Sep 24, 2014

SparkQA commented Sep 24, 2014

SparkQA commented Sep 24, 2014

SparkQA commented Sep 24, 2014

SparkQA commented Sep 24, 2014

SparkQA commented Sep 24, 2014

CodingCat commented Sep 24, 2014

SparkQA commented Sep 24, 2014

SparkQA commented Sep 25, 2014

SparkQA commented Sep 25, 2014

AmplabJenkins commented Sep 25, 2014

CodingCat commented Sep 26, 2014

mateiz commented Sep 28, 2014

mateiz commented Sep 28, 2014

CodingCat commented Sep 28, 2014

CodingCat commented Sep 28, 2014

mateiz commented Sep 28, 2014

witgo Sep 28, 2014

Choose a reason for hiding this comment

markhamstra Sep 30, 2014

Choose a reason for hiding this comment

CodingCat commented Sep 30, 2014

SparkQA commented Sep 30, 2014

SparkQA commented Sep 30, 2014

AmplabJenkins commented Sep 30, 2014

CodingCat commented Sep 30, 2014

mateiz Nov 26, 2014

Choose a reason for hiding this comment

mateiz commented Nov 26, 2014

SparkQA commented Nov 26, 2014

SparkQA commented Nov 26, 2014

AmplabJenkins commented Nov 26, 2014

CodingCat commented Nov 26, 2014

SparkQA commented Nov 26, 2014

SparkQA commented Nov 26, 2014

AmplabJenkins commented Nov 26, 2014

mateiz commented Nov 26, 2014

SparkQA commented Nov 26, 2014

CodingCat commented Nov 26, 2014

mateiz commented Nov 27, 2014

SparkQA commented Nov 27, 2014

AmplabJenkins commented Nov 27, 2014

mateiz commented Nov 27, 2014

aarondav commented Nov 27, 2014

CodingCat commented Nov 27, 2014

mateiz commented Nov 28, 2014